The goal of this notebook is to identify other clusters of mutations with same approach used to define the mutations in Genome-1 and Genome-2. Hopefully, these can be used to establish subclonal haplotypes on the background of Genome-1 and Genome-2.

The inputs for this analysis are (1) the filtered and combined variants annotated by major genotypes from identify-backgrounds.Rmd and (2) annotations of the positions of the main Measles open reading frames.

## ==== File paths input data ==== ##

# Data from lofreq and varscan labeled with G1 and G2 mutations
# labeled.data = "../../results/variants/genotyped_variants.csv"
labeled.data = snakemake@params[['incsv']]

# Annotations 
# annotations.filepath = "../../config/annotations.csv"
annotations.filepath = snakemake@params[['annotations']]

The output of this notebook are the filtered variants further annotated with clusters of mutations that make up probable subclonal haplotypes.

## ==== File paths output data ==== ##

# Path to save the results
# output.path = "../../results/variants/clustered_variants.csv"
output.path = snakemake@params[['outcsv']]

Identify Subclonal Haplotypes

Since the goal of this notebook is use the same technique that I used to identify Genome-1, Genome-2, and Genome-1-1, but for mutations that aren’t necessarily found in every single tissue, I need to account for missing mutations. To do this, I spread the data frame of frequencies and assign an allele frequency of 0 if a variant is missing from a tissue. This will allow us to cluster all of the mutations.

# Read in the variants with G1, G2, and G1-1 labeled
labeled.df = read_csv(labeled.data, show_col_types = FALSE)

haplotypes.label = labeled.df %>% 
  select(SNP, Haplotype, Background) %>% 
  distinct()

# Expand the mutations to have a frequency for every tissue.
expanded.df = labeled.df %>% 
  select(SNP, Tissue, AF) %>% 
  pivot_wider(names_from = "Tissue", values_from = "AF", values_fill = 0) %>% 
  pivot_longer(cols = !SNP, names_to = "Tissue", values_to = "AF") %>% 
  left_join(., select(labeled.df, c("SNP", "Tissue", "DP")), by = c("SNP", "Tissue")) %>% 
  mutate(DP = if_else(is.na(DP), 0, DP)) %>% 
  left_join(., haplotypes.label, by = "SNP") 

# Get the mean frequency of the major genomes 
tissue.mean = labeled.df %>% 
  filter(Haplotype %in% c("genome-1", "genome-2")) %>% 
  group_by(Tissue, Haplotype) %>% 
  summarize(AF.mean = mean(AF, na.rm = TRUE), 
            SD = sd(AF, na.rm = TRUE),
            N = n()) %>% 
  mutate(SE = SD / sqrt(N),
         Lower.CI = qt(1 - (0.05 / 2), N - 1) * SE,
         Upper.CI = qt(1 - (0.05 / 2), N - 1) * SE) %>% 
  rename("AF" = AF.mean)

# Tissue order - roughly by the location in the brain 
tissue_order = c(
  "SSPE 1", 
  "SSPE 2",
  "Frontal Cortex 2", 
  "Frontal Cortex 1", 
  "Frontal Cortex 3", 
  "Parietal Lobe",
  "Temporal Lobe", 
  "Occipital Lobe",
  "Hippocampus",
  "Internal Capsule", 
  "Midbrain", 
  "UBS",
  "Brain Stem",
  "Cerebellum",
  "Cerebellum Nucleus"
)

I’ll write the approach I used in identify-backgrounds.Rmd into a reusable function. I can provide a list of SNPs, the full data frame, and the number of clusters I’m targeting.

cluster.snps <- function(list.of.snps, snp.df, n.clusters) {
  
  # SNP as column and Allele frequency as row
  frequency.by.snp = snp.df %>% 
    filter(SNP %in% list.of.snps) %>% 
    select(AF, SNP, Tissue) %>% 
    pivot_wider(names_from = SNP, values_from = AF, values_fill = 0) %>% 
    select(!Tissue) 
  
  # Calculate R between every pair of columns while handling NAs 
  snp.correlation = cor(frequency.by.snp, use="pairwise.complete.obs")
  
  # Convert to distance (positive corr is close to 0)
  snp.dist = as.dist(1 - snp.correlation)
  
  # Create k-medoids clustering with n clusters
  snp.kmedoids = pam(snp.dist, n.clusters) 
  kclusters = snp.kmedoids$cluster
  
  # Convert to a data frame
  kclusters.df = data.frame(SNP = names(kclusters), cluster = kclusters)

  # Assign the clusters to the original data frame
  kmedoids.SNPs = snp.df %>% 
    filter(SNP %in% list.of.snps) %>% 
    left_join(., kclusters.df, by = "SNP") %>% 
    mutate(cluster = if_else(is.na(cluster), "no cluster", as.character(cluster)))
  
  n.clusters.per.snp = kmedoids.SNPs %>% 
    select(SNP, cluster) %>%
    distinct() %>% 
    group_by(cluster) %>% 
    count() %>% 
    mutate(cluster_size = paste0(cluster, " (", n, " snps in cluster)"))
  
  return(left_join(kmedoids.SNPs, n.clusters.per.snp, by = "cluster"))
  
}

Although this approach works reasonably well, it’s very susceptible to noise. This means that you can’t just provide all the variants at once and expect the algorithm to partition them into correct clusters. Instead, the best approach I found was to break variants down into bins based on their frequency and cluster these separately. The higher average frequency, the less noise, and the better the algorithm is at partitioning the variants into meaningful clusters.

>= 25% Allele Frequency Mutations

First, I’ll start with a ‘high-frequency’ bin. This includes all variants that reach greater than or equal to 25% in at least one tissue.

# SNPs that are above 25% at some point in some tissue
twentyfive.percent.or.more.snps = expanded.df %>% 
  filter(Haplotype == "subclonal") %>% # Not including the mutations we already haplotyped
  filter(AF >= 0.25) %>%
  pull(SNP) %>% 
  unique(.)

# Cluster the SNPs
twentyfive.percent.or.more.clusters.df = cluster.snps(list.of.snps = twentyfive.percent.or.more.snps, 
                                                      snp.df = expanded.df, 
                                                      n.clusters = 20)

# Mean of each tissue ~ for plotting
twentyfive.percent.or.more.clusters.mean = twentyfive.percent.or.more.clusters.df %>% 
  group_by(Tissue, cluster, cluster_size, n) %>% 
  summarize(AF = mean(AF))
## `summarise()` has grouped output by 'Tissue', 'cluster', 'cluster_size'. You
## can override using the `.groups` argument.
# Plot the clusters
twentyfive.percent.or.more.clusters.df %>% 
    ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
      geom_line(aes(group = SNP)) +
      geom_line(data = twentyfive.percent.or.more.clusters.mean, aes(x = Tissue, y = AF, group = 1, col = (cluster)), size = 1) +
      geom_ribbon(data = tissue.mean, aes(x=Tissue, ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
      facet_wrap(~cluster_size) + 
      scale_fill_manual(values=c("#424ef5", "#cf1919")) + 
      xlab("Tissue") + 
      ylab("Allele Frequency") +
      labs(fill="Haplotype", col = "Cluster") +
      theme_bw(20) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
      theme(strip.text.x = element_text(size = 12))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Some clusters have only a single mutation in them. Clearly, some of these belong to Genome-1, Genome-2, and Genome-1-1, but are missing from a single tissue, perhaps due to low coverage in that particular tissue.

high.frequency.not.clustered.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(n == 1) %>% # clusters that have only a single SNP in them
  pull(SNP)
  
expanded.df %>% 
  filter(SNP %in% high.frequency.not.clustered.snps) %>% 
  left_join(., distinct(select(labeled.df, SNP, Gene_Name, AA_Change))) %>% 
  mutate(mutation.type = paste(Gene_Name, SNP, sep = "-")) %>% 
    ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
      geom_line(aes(group = SNP)) +
      geom_ribbon(data = tissue.mean, aes(x=Tissue, ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
      facet_wrap(~mutation.type) + 
      scale_fill_manual(values=c("#424ef5", "#cf1919")) + 
      xlab("Tissue") + 
      ylab("Allele Frequency") +
      labs(fill="Haplotype", col = "Cluster") +
      theme_bw(20) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
      theme(strip.text.x = element_text(size = 12))
## Joining with `by = join_by(SNP)`

Cluster 18 (L-A15084G) is a mutation in Genome-1-1 that was too low frequency to be called in the Frontal Cortex 2 sample. Cluster 6 (M-C4502T) is mutation that’s part of Genome-2 but was too low coverage in the Cerebellum. Finally, Cluster 9 (M-C4573T) is a mutation in Genome-1 that was also too low coverage in the Cerebellum. Generally, the Cerebellum has the lowest coverage of any tissue, so this isn’t surprising. All-in-all, this adds three mutations to already existing haplotypes. I already addressed F-C7036T, N-G810A, and H-T7293C in the identify-backgrounds.Rmd notebook. Finally, M-C4532T might also belong as part of the reference?

I’ll annotate these variants in the main data frame.

# Clusters that are clearly genome-1 
genome.1.missing.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(8)) %>% 
  pull(SNP) %>% 
  unique(.)

# Clusters that are clearly genome-1-1
genome.1.1.missing.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(14)) %>% 
  pull(SNP) %>% 
  unique(.)

# Clusters that are clearly genome-2 
genome.2.missing.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(6)) %>% 
  pull(SNP) %>% 
  unique(.)

# Clusters that break the rules 
maybe.in.both.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(12, 11, 2, 7)) %>% 
  pull(SNP) %>% 
  unique(.)


# Annotate the labeled data frame with what's possible here
updated.haplotype.df = labeled.df %>%
  mutate(Background = case_when(SNP %in% genome.1.missing.snps ~ "genome-1",
                                SNP %in% genome.1.1.missing.snps ~ "genome-1",
                                SNP %in% genome.2.missing.snps ~ "genome-2",
                                SNP %in% maybe.in.both.snps ~ "both",
                                TRUE ~ Background)) %>% 
  mutate(Haplotype = case_when(SNP %in% genome.1.missing.snps ~ "genome-1",
                               SNP %in% genome.1.1.missing.snps ~ "genome-1-1",
                               SNP %in% genome.2.missing.snps ~ "genome-2",
                               SNP %in% maybe.in.both.snps ~ "both",
                               TRUE ~ Haplotype))

Now, lets look at the remainder of the clusters of SNPs. These are SNPs that actually clustered with other variants.

twentyfive.percent.or.more.clusters.df %>% 
  filter(n > 1) %>% 
    ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
      geom_line(data = filter(twentyfive.percent.or.more.clusters.df, n > 1),
                aes(x = factor(Tissue, levels = tissue_order), y = AF, group = SNP, col = (cluster)), size = 1) +
      geom_ribbon(data = tissue.mean, aes(x=factor(Tissue, levels = tissue_order), ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
      facet_wrap(~cluster_size) + 
      scale_fill_manual(values=c("#424ef5", "#cf1919")) + 
      xlab("Tissue") + 
      ylab("Allele Frequency") +
      labs(fill="Haplotype", col = "Cluster") +
      theme_bw(20) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
      theme(strip.text.x = element_text(size = 12))

All of these look like reasonable clusters, so I’ll annotate the variant data frame with these new clusters (subclonal haplotypes). I’ll also try to guess if these are on the background of Genome-1 or Genome-2 if it’s possible.

# Genome-1 
cluster.1.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(1)) %>% 
  pull(SNP) %>% 
  unique(.)

# Genome-1 
cluster.2.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(19)) %>% 
  pull(SNP) %>% 
  unique(.)

# Genome-1 
cluster.3.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(16)) %>% 
  pull(SNP) %>% 
  unique(.)

# Genome-2
cluster.4.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(10)) %>% 
  pull(SNP) %>% 
  unique(.)

# Genome-2
cluster.5.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(20)) %>% 
  pull(SNP) %>% 
  unique(.)

# Genome-2
cluster.6.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(17)) %>% 
  pull(SNP) %>% 
  unique(.)

# Genome-2
cluster.7.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(3)) %>% 
  pull(SNP) %>% 
  unique(.)

# ? 
cluster.8.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(13)) %>% 
  pull(SNP) %>% 
  unique(.)

# ? 
cluster.9.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(4)) %>% 
  pull(SNP) %>% 
  unique(.)

# ? 
cluster.10.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(5)) %>% 
  pull(SNP) %>% 
  unique(.)

# ? 
cluster.11.snps = twentyfive.percent.or.more.clusters.df %>% 
  filter(cluster %in% c(18)) %>% 
  pull(SNP) %>% 
  unique(.)
  
# Annotate the labeled data frame with what's possible here
updated.haplotype.df = updated.haplotype.df %>%
  mutate(Background = case_when(SNP %in% cluster.1.snps ~ "unknown",
                                SNP %in% cluster.2.snps ~ "unknown",
                                SNP %in% cluster.3.snps ~ "unknown",
                                SNP %in% cluster.4.snps ~ "unknown",
                                SNP %in% cluster.5.snps ~ "unknown",
                                SNP %in% cluster.6.snps ~ "unknown",
                                SNP %in% cluster.7.snps ~ "unknown",
                                SNP %in% cluster.8.snps ~ "unknown",
                                SNP %in% cluster.9.snps ~ "unknown",
                                SNP %in% cluster.10.snps ~ "unknown",
                                SNP %in% cluster.11.snps ~ "unknown",
                                TRUE ~ Background)) %>% 
  mutate(Haplotype = case_when(SNP %in% cluster.1.snps ~ "cluster 1",
                               SNP %in% cluster.2.snps ~"cluster 2",
                               SNP %in% cluster.3.snps ~"cluster 3",
                               SNP %in% cluster.4.snps ~ "cluster 4",
                               SNP %in% cluster.5.snps ~ "cluster 5",
                               SNP %in% cluster.6.snps ~ "cluster 6",
                               SNP %in% cluster.7.snps ~ "cluster 7",
                               SNP %in% cluster.8.snps ~ "cluster 8",
                               SNP %in% cluster.9.snps ~ "cluster 9",
                               SNP %in% cluster.10.snps ~ "cluster 10",
                               SNP %in% cluster.11.snps ~ "cluster 11",
                               TRUE ~ Haplotype)) 

Haplotypes >= 5% and < 25%

Now, I’ll move onto the next bin. These are mutations that reach at least 5% in one tissue but weren’t part of the previous analysis. Most of these will be challenging to haplotype due to noise. I’ll only try to specify the clusters that seem the most clear cut.

twentyfive.percent.or.fewer.snps = expanded.df %>% 
  filter(!SNP %in% twentyfive.percent.or.more.snps) %>% 
  filter(Haplotype == "subclonal") %>% 
  filter(AF < 0.25 & AF >= 0.05) %>%
  pull(SNP) %>% 
  unique(.)

# Cluster the SNPs
twentyfive.percent.or.fewer.clusters.df = cluster.snps(list.of.snps = twentyfive.percent.or.fewer.snps, 
                                      snp.df = expanded.df, 
                                      n.clusters = 40)


# Mean of each tissue
twentyfive.percent.or.fewer.clusters.mean = twentyfive.percent.or.fewer.clusters.df %>% 
  group_by(Tissue, cluster, cluster_size, n) %>% 
  summarize(AF = mean(AF))
## `summarise()` has grouped output by 'Tissue', 'cluster', 'cluster_size'. You
## can override using the `.groups` argument.
# Plot the clusters
twentyfive.percent.or.fewer.clusters.df %>% 
  filter(n > 1) %>% 
    ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
      geom_line(aes(group = SNP)) +
      geom_line(data = filter(twentyfive.percent.or.fewer.clusters.mean, n > 1), aes(x = Tissue, y = AF, group = 1, col = (cluster)), size = 1) +
      geom_ribbon(data = tissue.mean, aes(x=Tissue, ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
      facet_wrap(~cluster_size) + 
      scale_fill_manual(values=c("#424ef5", "#cf1919")) + 
      xlab("Tissue") + 
      ylab("Allele Frequency") +
      labs(fill="Haplotype", col = "Cluster") +
      theme_bw(20) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
      theme(strip.text.x = element_text(size = 12))

Most of these aren’t very clear cut clusters. I took the 8 that looked the most promising.

promising.subclonal.clusters = c(20, 11, 21, 7, 27, 4, 5, 25)

twentyfive.percent.or.fewer.clusters.df %>% 
  filter(cluster %in% promising.subclonal.clusters) %>% 
    ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
      geom_line(aes(group = SNP)) +
      geom_line(data = filter(twentyfive.percent.or.fewer.clusters.mean, cluster %in% promising.subclonal.clusters), 
                aes(x = factor(Tissue, levels = tissue_order), y = AF, group = 1, col = (cluster)), size = 1) +
      geom_ribbon(data = tissue.mean, aes(x=factor(Tissue, levels = tissue_order), ymin=AF-SD, ymax=AF+SD, group = Haplotype, fill = Haplotype), alpha=0.2, colour = NA) +
      facet_wrap(~cluster_size) + 
      scale_fill_manual(values=c("#424ef5", "#cf1919")) + 
      xlab("Tissue") + 
      ylab("Allele Frequency") +
      labs(fill="Haplotype", col = "Cluster") +
      theme_bw(20) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
      theme(strip.text.x = element_text(size = 12))

What’s left to cluster?

We were able to cluster a reasonable number of the high-frequency mutations. How many mutations haven’t been haplotyped yet?

# Still labeled as 'subclonal' 
mutations.with.no.info = updated.haplotype.df %>% 
  filter(Background == 'subclonal') %>% 
  pull(SNP) %>% 
  unique()
print(paste("There are about", length(mutations.with.no.info), "that are still totally un-haplotyped."))
## [1] "There are about 372 that are still totally un-haplotyped."
# Never above 5% - nearly impossible to haplotype 
mutations.never.above.5perc = updated.haplotype.df %>% 
  filter(Background == 'subclonal') %>% 
  group_by(SNP) %>% 
  summarise(Max = max(AF)) %>% 
  filter(Max < 0.05) %>% 
  pull(SNP) %>% 
  unique()

# Mutations that have some info
mutations.with.info = updated.haplotype.df %>% 
  filter(Background != 'subclonal') %>% 
  pull(SNP) %>% 
  unique()
print(paste("There are", length(mutations.with.info), "that have been clustered."))
## [1] "There are 159 that have been clustered."
# Mutations that are left, but are above 5% some of the time. 
mutations.left.to.haplotype = mutations.with.no.info[which(!mutations.with.no.info %in% mutations.never.above.5perc)]
print(paste("There are still about", length(mutations.left.to.haplotype), "that can reasonably be haplotyped."))
## [1] "There are still about 163 that can reasonably be haplotyped."
# What's the distribution of the frequency of these mutations?
updated.haplotype.df %>% 
  filter(SNP %in% mutations.left.to.haplotype) %>% 
  ggplot(aes(x = AF)) + 
    geom_histogram(bins = 30) + 
    theme_bw()

All Current Haplotype Clusters

Here are the frequency of all current haplotypes in the 15 tissue samples.

haplotypes.label = updated.haplotype.df %>% 
  select(SNP, Haplotype, Background) %>% 
  distinct()

haplotype.order = c("cluster 1",
                    "cluster 2",
                    "cluster 3",
                    "cluster 4",
                    "cluster 5",
                    "cluster 6",
                    "cluster 7",
                    "cluster 8",
                    "cluster 9",
                    "cluster 10",
                    "cluster 11",
                    "cluster 12",
                    "cluster 13",
                    "cluster 14", 
                    "cluster 15",
                    "cluster 16",
                    "cluster 17",
                    "cluster 18", 
                    "cluster 19", 
                    "genome-1-1")
  

# Expand the mutations to have a frequency for every tissue.
expanded.df = updated.haplotype.df %>% 
  select(SNP, Tissue, AF) %>% 
  pivot_wider(names_from = "Tissue", values_from = "AF", values_fill = 0) %>% 
  pivot_longer(cols = !SNP, names_to = "Tissue", values_to = "AF") %>% 
  left_join(., select(labeled.df, c("SNP", "Tissue", "DP")), by = c("SNP", "Tissue")) %>% 
  mutate(DP = if_else(is.na(DP), 0, DP)) %>% 
  left_join(., haplotypes.label, by = "SNP") 

# Get the mean frequency of the major genomes 
tissue.mean = updated.haplotype.df %>% 
  filter(Haplotype %in% c("genome-1", "genome-2")) %>% 
  group_by(Tissue, Haplotype) %>% 
  summarize(AF.mean = mean(AF, na.rm = TRUE), 
            SD = sd(AF, na.rm = TRUE),
            N = n()) %>% 
  mutate(SE = SD / sqrt(N),
         Lower.CI = qt(1 - (0.05 / 2), N - 1) * SE,
         Upper.CI = qt(1 - (0.05 / 2), N - 1) * SE) %>% 
  rename("AF" = AF.mean, "Genotype" = Haplotype)

haplotype.mean = expanded.df %>% 
  group_by(Tissue, Haplotype) %>% 
  summarize(AF = mean(AF))

expanded.df %>% 
    filter(!Haplotype %in% c("fixed", "both", "subclonal", "genome-1", "genome-2")) %>% 
    ggplot(aes(x = factor(Tissue, levels = tissue_order), y = AF)) +
      geom_line(aes(group = SNP)) +
      geom_line(data = filter(haplotype.mean, !Haplotype %in% c("fixed", "both", "subclonal", "genome-1", "genome-2")), 
                aes(x = factor(Tissue, levels = tissue_order), y = AF, group = 1, col = (Haplotype)), size = 1) +
      geom_ribbon(data = tissue.mean, aes(x=factor(Tissue, levels = tissue_order), ymin=AF-SD, ymax=AF+SD, group = Genotype, fill = Genotype), alpha=0.2, colour = NA) +
      facet_wrap(~factor(Haplotype, levels = haplotype.order)) + 
      scale_fill_manual(values=c("#424ef5", "#cf1919")) + 
      xlab("Tissue") + 
      ylab("Allele Frequency") +
      labs(fill="Genotype", col = "Cluster") +
      theme_bw(20) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
      theme(strip.text.x = element_text(size = 12))

# ggsave("../../results/figures/all-current-clusters.png", width = 20, height = 12, dpi = 300)

Looking closely at the haplotypes, I think there are two amendments:

  1. It looks like cluster 17 and 18 aren’t clear haplotypes, so for now, I’ll remove these.
  2. Cluster 11 and 19 are probably the same cluster, certainly 19 is on the background of cluster 11.
updated.haplotype.df = updated.haplotype.df %>%
  mutate(Haplotype = case_when(Haplotype %in% c("cluster 17", "cluster 18") ~ "subclonal",
                               Haplotype %in% c("cluster 19") ~ "cluster 11",
                               TRUE ~ Haplotype)) 

Finally, it’s difficult to haplotypes these particular groups because there aren’t a lot of tissues where we can establish a frequency correlation.

write_csv(updated.haplotype.df, output.path)

Figure 2 of the Paper

For the second figure of the paper, we want to illustrate the two major genotypes and the ancestral mutations in one of the two primary samples (SSPE 1 and SSPE 2). In addition, we want to illustrate some of the major driver mutations that we find on the background of these haplotypes.